tidyverse, grouping and formulas
5 Jun 2025
I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.
Warning
You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:
Materials
Gisteren hebben we deze onderwerpen behandeld:
Vandaag leren we:
planet <- c("Mercury", "Venus", "Earth", "Mars",
"Jupiter", "Saturn", "Uranus", "Neptune")
planet_type <- c("Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Terrestrial planet", "Gas giant",
"Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets <- data.frame(planet_type = factor(planet_type),
diameter, rotation, rings,
row.names = planet)
planets planet_type diameter rotation rings
Mercury Terrestrial planet 0.382 58.64 FALSE
Venus Terrestrial planet 0.949 -243.02 FALSE
Earth Terrestrial planet 1.000 1.00 FALSE
Mars Terrestrial planet 0.532 1.03 FALSE
Jupiter Gas giant 11.209 0.41 TRUE
Saturn Gas giant 9.449 0.43 TRUE
Uranus Gas giant 4.007 -0.72 TRUE
Neptune Gas giant 3.883 0.67 TRUE
psych::describe() vars n mean sd median trimmed mad min max range skew
planet_type* 1 8 1.50 0.53 1.50 1.50 0.74 1.00 2.00 1.00 0.00
diameter 2 8 3.93 4.23 2.44 3.93 2.58 0.38 11.21 10.83 0.69
rotation 3 8 -22.70 91.32 0.55 -22.70 0.69 -243.02 58.64 301.66 -1.65
rings 4 8 NaN NA NA NaN NA Inf -Inf -Inf NA
kurtosis se
planet_type* -2.23 0.19
diameter -1.36 1.49
rotation 1.32 32.29
rings NA NA
psych::describe() vars n mean sd median trimmed mad min max range skew
diameter 2 8 3.93 4.23 2.44 3.93 2.58 0.38 11.21 10.83 0.69
rotation 3 8 -22.70 91.32 0.55 -22.70 0.69 -243.02 58.64 301.66 -1.65
kurtosis se
diameter -1.36 1.49
rotation 1.32 32.29
With an inner join, we combine two data frames based on a common key. Only the rows with matching keys in both data frames are kept.
With a left join, we keep all rows from the left data frame and only the matching rows from the right data frame. If there is no match, the result will contain NA for the columns from the right data frame.
With a right join, we keep all rows from the right data frame and only the matching rows from the left data frame. If there is no match, the result will contain NA for the columns from the left data frame.
With a full join, we keep all rows from both data frames. If there is no match, the result will contain NA for the columns from the other data frame.
tidyverse and the data analysis cycleLeading principle: language of programming should really behave like a language, tidyverse.
tidyverse: a few key verb that perform common types of data manipulation.
The tidyverse packages operate on tidy data:
Each column is a variable
Each row is an observation
Each cell is a single value
Untidy versus tidy data
dplyr packagedplyrThe dplyr package is a specialized package for working with data.frames (and the related tibble) to transform and summarize tabular data:
dplyr cheatsheetdplyr functionsThere are many functions available in dplyr, but we will focus on just the following dplyr functions (verbs):
| dplyr verbs | Description |
|---|---|
glimpse() |
a transposed print of the data that shows all variables |
select() |
selects variables (columns) based on their names |
filter() |
subsets the rows of a data frame based on their values |
arrange() |
re-order or arrange rows |
mutate() |
adds new variables, or new variables that are functions of existing variables |
summarise() |
creates a new data frame with statistics of the variables (optional grouped by another variables) |
group_by() |
allows for group operations in the “split-apply-combine” concept |
Check the dplyr cheat sheet for examples.
dplyr::glimpse()str(), but shows more data.str() shows more detailed information about data structure.dplyr::glimpse(planets)
Rows: 8
Columns: 4
$ planet_type <fct> Terrestrial planet, Terrestrial planet, Terrestrial planet…
$ diameter <dbl> 0.382, 0.949, 1.000, 0.532, 11.209, 9.449, 4.007, 3.883
$ rotation <dbl> 58.64, -243.02, 1.00, 1.03, 0.41, 0.43, -0.72, 0.67
$ rings <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE
str(planets)
'data.frame': 8 obs. of 4 variables:
$ planet_type: Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
$ diameter : num 0.382 0.949 1 0.532 11.209 ...
$ rotation : num 58.64 -243.02 1 1.03 0.41 ...
$ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...dplyr::mutate()dplyr::mutate() adds a new variable to the data frame.
Arguments:
.keep specifies which variables to return, “all”, “used”, “unused”, “none”.
.before or .after determine where the new variables are inserted.
dplyr::mutate()Example: compute a new variable rotation_diameter = rotation/diameter, add it to the data frame and keep all other variables:
Rows: 8
Columns: 5
$ planet_type <fct> Terrestrial planet, Terrestrial planet, Terrestrial …
$ diameter <dbl> 0.382, 0.949, 1.000, 0.532, 11.209, 9.449, 4.007, 3.…
$ rotation <dbl> 58.64, -243.02, 1.00, 1.03, 0.41, 0.43, -0.72, 0.67
$ rings <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE
$ rotation_diameter <dbl> 153.50785340, -256.08008430, 1.00000000, 1.93609023,…
The pipe operations do not make changes to the original data set, unless you save the results:
Temporary:
Changes saved in new data frame:
dplyr::select()Select variables type and diameter from the planets data frame:
dplyr::select()Select numerical variables with where(is.numeric):
dplyr::select()Select numerical variables with where(is.numeric):
dplyr::filter()Selects subsets of the rows of a data frame based on their values.
Filter the data based on the planets that have a ring and that are gas giants:
Select diameter only for the planets that have a ring and that are gas giants:
Select diameter only for the planets that have a ring and that are gas giants:
To move the row names to a column, you can use rownames_to_column() from the tibble package:
In this case, we want to select the planets that start with the letter “M”. We first use the rownames_to_column() function to move the row names to a column, and then we use filter() with stringr::str_starts() to select the rows:
planets %>%
rownames_to_column(var = "planet_name") %>%
filter(stringr::str_starts(planet_name, "M")) planet_name planet_type diameter rotation rings
1 Mercury Terrestrial planet 0.382 58.64 FALSE
2 Mars Terrestrial planet 0.532 1.03 FALSE
We cannot use starts_with() here, because it only works for column names, not for values in a column.
dplyr::arrange()Order the rows of the planets data set on ascending values of diameter:
Original data set:
planet_type diameter rotation rings
Mercury Terrestrial planet 0.382 58.64 FALSE
Venus Terrestrial planet 0.949 -243.02 FALSE
Earth Terrestrial planet 1.000 1.00 FALSE
Mars Terrestrial planet 0.532 1.03 FALSE
Jupiter Gas giant 11.209 0.41 TRUE
Saturn Gas giant 9.449 0.43 TRUE
Uranus Gas giant 4.007 -0.72 TRUE
Neptune Gas giant 3.883 0.67 TRUE
Ordered data set, based on diameter:
planet_type diameter rotation rings
Mercury Terrestrial planet 0.382 58.64 FALSE
Mars Terrestrial planet 0.532 1.03 FALSE
Venus Terrestrial planet 0.949 -243.02 FALSE
Earth Terrestrial planet 1.000 1.00 FALSE
Neptune Gas giant 3.883 0.67 TRUE
Uranus Gas giant 4.007 -0.72 TRUE
Saturn Gas giant 9.449 0.43 TRUE
Jupiter Gas giant 11.209 0.41 TRUE
dplyrSuppose we want to perform the following transformations:
planets on ascending values of rotation> 1planet_type, diameter and rotationWith base R code:
summarise()The dplyr function for summarizing data:
mean_diameter sd_diameter
1 3.926375 4.226738
mean(), median(), sd(), var(), sum(), for numeric variablesn(), n_distinct() for counts?dplyr::select and cheat sheet)group_by()The dplyr function for grouping rows of a data frame is very useful in combination with summarise()
Example: group the planets based on having rings (or not) and compute the mean and the standard deviation for each group.
RCalculations based on missing values (NA’s) are not possible in R:
There are two easy ways to perform “listwise deletion”:
dplyrCode with a single pipe operator on one line and spaces around %>%:
Code with multiple pipe operators on multiple lines:
but definitely NOT:
tidyverse style guideGerko Vink @ Anton de Kom Universiteit, Paramaribo